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Abstract 

Parsimony, including sparsity and low rank, has been shown to suc- 
cessfully model data in numerous machine learning and signal processing 
tasks. Traditionally, such modeling approaches rely on an iterative al- 
gorithm that minimizes an objective function with parsimony-promoting 
terms. The inherently sequential structure and data-dependent complex- 
ity and latency of iterative optimization constitute a major limitation in 
many applications requiring real-time performance or involving large-scale 
data. Another limitation encountered by these modeling techniques is the 
difficulty of their inclusion in discriminative learning scenarios. In this 
work, we propose to move the emphasis from the model to the pursuit 
algorithm, and develop a process-centric view of parsimonious modeling, 
in which a learned deterministic fixed-complexity pursuit process is used 
in lieu of iterative optimization. We show a principled way to construct 
learnable pursuit process architectures for structured sparse and robust 
low rank models, derived from the iteration of proximal descent algo- 
rithms. These architectures learn to approximate the exact parsimonious 
representation at a fraction of the complexity of the standard optimiza- 
tion methods. We also show that appropriate training regimes allow to 
naturally extend parsimonious models to discriminative settings. State-of- 
the-art results are demonstrated on several challenging problems in image 
and audio processing with several orders of magnitude speedup compared 
to the exact optimization algorithms. 

*P. Sprechmann and G. Sapiro are with the Department of Electrical and Computer 
Engineering, Duke University, Durham 27708, USA. Email: pablo.sprechmann@duke.edu, 
guillermo.sapiro@duke.edu. 

^A. M. Bronsteind is with School of Electrical Engineering, Tel Aviv University, Tel Aviv 
69978, Israel. Email: bron@cng.tau.ac.il. 

tWork partially supported by NSF, ONR, NGA, DARPA, AFOSR, ARO, and BSF. 



1 



1 Introduction 



Parsimony, preferring a simple explanation to a more complex one, is probably 
one of the most intuitive principles widely adopted in the modeling of nature. 
The past two decades of research have shown the power of parsimonious repre- 
sentation in a vast variety of applications from diverse domains of science. 

One of the simplest among parsimonious models is sparsity, asserting that 
the signal has many coefficients close or equal to zero when represented in some 
domain, usually referred to as dictionary. The pursuit of sparse representations 
was shown to be possible using tools from convex optimization, in particular, 
via ^1 norm minimization [TJ[2]. Works [Sill], followed by many others, intro- 
duced efficient computational techniques for dictionary learning and adaptation. 
Sparse modeling is in the heart of modern approaches to image enhancement 
such as denoising, demosaicing, impainting, and super-resolution, to mention 
just a few. 

As many classes of data are not described well by the element-wise sparsity 
model and the norm inducing it, more elaborate structured sparse models 
have been developed, in which non-zero elements are no more unrelated, but 
appear in groups or hierarchies of groups [71 |H1 IB EHj • Such models have 
been shown useful in the analysis of functional MRI and genetic data for exam- 
ple. 

In the case of matrix-valued data, complexity is naturally measured by the 
rank, which also induces a notion of parsimony. A recent series of works have elu- 
cidated the beautiful relationship between sparsity and low rank representations, 
showing that rank minimization can be achieved through convex optimization 
[iTJ [12] . The combination of low-rank and sparse models paved the path to new 
robust alternatives of principal component analysis (RPCA) [131 E! and non- 
negative matrix factorization (RNMF) [15 , and addressing challenging matrix 
completion problems |12j . RPCA was also found useful in important applica- 
tions such as face recognition and modeling, background modeling, and audio 
source separation. Another relevant low rank modeling scheme is non-negative 
matrix factorization (NMF)[TB], where the input vectors are represented as non- 
negative linear combination of a non-negative under-complete dictionary. NMF 
has been particularly successful in applications such as object recognition and 
audio processing. 
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1.1 Prom model-centric to process-centric parsimonious 
modeling 

All existing parsimonious modeling methods essentially follow the same pattern: 
First, an objective comprising a data fitting term and parsimony-promoting 
penalty terms is constructed; next, an iterative optimization algorithm is in- 
voked to minimize the objective, pursuing either the parsimonious represen- 
tation of the data in a given dictionary, or the dictionary itself. Despite its 
remarkable achievements, such a model- centric approach suffers from critical 
disadvantages and limitations. 

The inherently sequential structure and the data-dependent complexity and 
latency of iterative optimization tools often constitute a major computational 
bottleneck. The quest for efficiently solving sparse representation pursuit has 
given rise to a rich family of algorithms, both for sparse coding |T71 [T51 [THl [201 
[111122 , RPCA [13 [23121113 and NMF [Ulig problems. Despite the perma- 
nent progress reported in the literature, the state-of-the-art algorithms require 
hundreds or thousands of iterations to converge, making their use impractical 
in scenarios demanding real-time performance or involving large-scale data. 

Relying on the explicit solution of an optimization problem furthermore 
limits the applicability of parsimonious models in supervised learning scenarios, 
where the higher-level training objective would depend on the solution of the 
lower-level pursuit problem. The resulting bilevel optimization problems are 
notoriously difhcult to solve in general; the non-differentiability of the lower- 
level parsimony-inducing objective makes the solution practically impossible 
[27j . This partially explains why sparse representations, that are so widely 
adopted for the construction of generative models, had such a modest success 
in the construction of discriminative models. 

In this paper, we take several steps to depart from the model-centric ideology 
relying on an iterative solver by shifting the emphasis from the model to the 
pursuit process. Our approach departs from the observation that, despite being 
highly non-linear and hard to compute, the mapping between a data vector 
and its parsimonious representation resulting from the optimization procedure 
is deterministic. The curse of dimensionality precludes the approximation of 
this mapping on all possible, even modestly sized input vectors; however, since 
real data tend to have low intrinsic dimensionality, the mapping can be inferred 
explicitly on the support of the distribution of the input data. 

Recently, [211 1211 have proposed to trade off precision in the sparse repre- 
sentation for computational speed-up by learning non-linear regressors capable 
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of producing good approximations of sparse codes in a fixed amount of time. 
However, the large number of degrees of freedom, for which a good initiahza- 
tion is difficult to provide, made this effort only modestly successful. In their 
inspiring recent paper, [30] showed that a particular network architecture can 
be derived from the iterative shrinkage-thresholding (ISTA) [17] and proximal 
coordinate descent (CoD) algorithms [T^ . 

These works were among the first to bridge between the optimization based 
sparse models and the inherently process-centric neural networks, and in partic- 
ular auto-encoder networks [3TJ [32] , extensively explored by the deep learning 
community. 

1.2 Contributions 

The main contribution of this paper is to propose a comprehensive framework 
for process-centric parsimonious modeling. The obtained encoders can be used 
to produce fast approximants or predictors of optimization based parsimonious 
models or as modelers in their own right, this is, pursuit processes that might 
not be minimizing any specific objective function. Specifically, this paper makes 
four main contributions: 

First, in Section [4] we propose a process-centric approach to parsimonious 
modeling. We begin by proposing a principled way to construct encoders ca- 
pable of approximating an important family of parsimonious models (briefly 
described in Section [2]), including general sparse coding paradigms (hierarchi- 
cal and non-overlapping group sparsity), robust PCA, and NMF. By extending 
the original ideas in |30| , we propose tailored pursuit architectures derived from 
first-order proximal descent algorithms, which are briefly presented in Section[3j 
Note that unlike the standard sparse coding setting, the exact first-order RPCA 
and RNMF algorithms cannot be used directly, as each iteration involves the sin- 
gular value decomposition (SVD). As a remedy, we propose to use an algorithm 
inspired by the non-convex optimization techniques in pSj . 

Second, this new approach allows the encoders to be trained in an online 
manner, which makes the fast encoders no more restricted to work with a fixed 
distribution of input vectors known a priori (limitation existing, for example, 
in [30)). and removes the need to run the exact algorithms at training. The pro- 
posed approach can be used with a predefined dictionary or learn it in an online 
manner on the very same data vectors fed to it. While differently motivated, in 
this setting, our framework is related to the sparse autoencoders [35] . 

Third, we show that abandoning the iterative minimization in favor of a 
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learned pursuit process allows to incorporate the parsimonious representation 
into higher-level optimization problems in a natural way. In particular, in Sec- 
tion [5] we show a very simple and efficient extension of the proposed RPCA 
framework to cases where the data undergo an unknown transformation that is 
sought for during the pursuit [S^.. We also show the construction of discrimi- 
native parsimonious models. 

Finally, in Secion|6]we demonstrate our approaches on applications in im- 
age classification, face modeling, signal separation and denoising, and speaker 
identification, where our fast encoders perform similarly to or better than the 
iterative pursuit processes at a fraction of the complexity of the latter. Faster 
than real-time state-of-the-art results are achieved in several such applications. 
The present paper generalizes and gives a more rigorous treatment to results 
previously published by the authors in [MJ [3S] . 

2 Parsimonious models 

Let X £ be a give data matrix. In this work, we concentrate our attention 

on the general parsimonious modeling problem that can be posed as the solution 
of the minimization problem 

min^|lX-DZ|l^-H^(Z) + 0(D), (1) 

optimized over Z e M?^" alone or jointly with D e M"^^''. Here Z e M^^" is 
the representation (parsimonious code) of the data in the dictionary, and the 
penalty terms ?A(Z) and (/'(D) induce a certain structure of the code and the 
dictionary, respectively. When the minimization is performed over both the 
dictionary and the representation, it is non-convex. 

We will explicitly distinguish between parsimonious coding or representation 
pursuit problems (representing data with a given model), and the harder par- 
simonious modeling problem (constructing a model describing given data, e.g., 
learning a dictionary). In the former, D is fixed and (/'(D) is constant. Most 
useful formulations use convex regularization ip{Zi) of the representation. 

In many relevant applications, the entire data matrix X is not available 
a priori. The data samples {xt}fgN, xt G M™, arrive sequentially; the index t 
should be interpreted as time. Online parsimonious modeling aims at estimating 
and refining the model as the data come in [36J. The need for online schemes 
also arises when the available training data are simply too large to be handled 
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together. When the regularized function ip is vector-wise separable, 

n 

^PiZ) = ^^(z,), (2) 

i=l 

problem ([!]) can solved in an online fashion using an alternating minimization 
scheme. As a new data vector xt is received, we first obtain its representation 
Zt given the current model estimate, Dt-i- This is achieved by solving the 
representation pursuit problem 

1 2 
Zt = argmin- ||x4 - D4_iz||2 + lACz)- (3) 

z ^ 

Then, we update the model using the coefficients, {zj}j<t, computed during the 
previous steps of the algorithm, 

' 1 

Dt = argminV/3,-||x, -Dzj|l^ + 0(D), (4) 

where E [0,1] is a forgetting factor that can be added to rescale older infor- 
mation so that newer estimates have more weight. This can be efficiently solved 
without remembering all the past codes [36] . 

In what follows, we detail several important instances of Q, for both mod- 
eling or pursuit, and how they can be cast as online learning problems. 

2.1 Structured sparsity 

The underlying assumption of sparse models is that the input vectors can be 
reconstructed accurately as a linear combination of the dictionary atoms with 
a small number of non-zero coefficients. Sparse models are enforced by using 
sparsity-promoting regularizers ipCZ). The simplest choice of such a regularizer 
is "0(2) = ^J2'i=i l|zi||i (with A > 0), for which the pursuit problem can be split 
into n independent problems on the columns of Z, 

1 2 

min - llx — DzIL + Allzlli. (5) 

zeR<! 2 

This is the classical unstructured sparse coding problem, often referred to as 
Lasso [2] or basis pursuit [Tl. 

Structured sparse models further assume that the pattern of the non-zero 
coefficients of Z exhibits a specific structure known a priori. Let A ^ {1, . . . ,q} 
denote groups of indices of atoms. Then, we define a group structure, Q, as a 
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collection of groups of atoms, G = {Ai, . . . , The regularizer corresponding 

to the group structure is defined as the column-wise sum 

IS I 



and Zj. denotes the subvector of z corresponding to the group of atoms Ar- The 
regularizer function tp in the Lasso problem Q arises from the special case of 
singleton groups G = {{!}, {2}, . . . , {q}} and setting — A. As such, the effect 
of ipg on the groups of z is a natural generalization of the one obtained with 
unstructured sparse coding: it "turns on" and "off" atoms in groups according 
to the structure imposed by G- 

Several important structured sparsity settings can be cast as particular cases 
of (|6| : Group sparse coding, a generalization of the standard sparse coding to 
the cases in which the dictionary is sub-divided into groups [5], in this case G 
is a partition of {1, . . . , g}; Hierarchical sparse coding, assuming a hierarchical 
structure of the non-zero coefficients [SI [3 H] • The groups in G form a hierarchy 
with respect to the inclusion relation (a tree structure), that is, if two groups 
overlap, then one is completely included in the other one; Overlapping group 
sparse coding, relaxing the hierarchy assumption so that groups of atoms Ar are 
allowed to overlap. This model was found to be successful in modeling gene ex- 
pression and other genetic data. Note that in all the above cases, the structure 
is repeated across the columns of Z; consequently, the structured sparse coding 
problem ([T]) can be split into n independent problems operating on the columns 
of X and Z. One possible way of extending sparse models is by imposing struc- 
ture on sub-matrices of Z. Collaborative sparse coding generalizes the concept 
of structured sparse coding to collections of input vectors by promoting given 
patterns of non-zero elements in the coefficient matrix [3 [37] . 

2.2 Low-rank models and robust PCA 

Another significant manifestation of parsimony typical to many classes of data is 
low rank. The classical low rank model is principal component analysis (PCA), 
in which the data matrix X S M™^" (each column of X is an m-dimensional 
data vector), is decomposed into X = L + E, where L is a low rank matrix and 
E is a perturbation matrix. 

PCA is known to produce very good results when the perturbation is small 
[55] . However, its performance is highly sensitive to the presence of samples 
not following the model; even a single outlier in the data matrix X can render 




where 
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the estimation of the low rank component arbitrarily far from the true matrix 
L. [131 El proposed to robustify the model by adding a new term to the 
decomposition to account for the presence of outliers, X = L + O + E, where O 
is an outlier matrix with a sparse number of non-zero coefficients of arbitrarily 
large magnitude. In one of its formulations, the robust principal component 
analysis can be pursued by solving the convex program 

^ jnin ^ ||X - L - 0||2 + A,||L|U + A||0||i. (7) 

The same way the £i norm is the convex surrogate of the Iq norm (i.e., the 
convex norm closest to £o), the nuclear norm, denoted as || • ||*, is the convex 
surrogate of matrix rank. The parameter A* controls the tradeoff between the 
data fitting error and the rank of the approximation. 

In [11] it was shown that the nuclear norm of a matrix of L can be reformu- 
lated as a penalty over all possible factorizations 

||L|l,=mini||A||^ + i||B||^ s.t. AB = L, (8) 

The minimum is achieved through the SVD of L = USV^: the mimmizer 
of ([8| is A = US 2 and B = S^V. This factorization has been recently 
exploited in parallel processing across multiple processors to produce state-of- 
the-art algorithms for matrix completion problems , as well as an alternative 
approach to robustifying PCA in 

In ([t]), neither the rank of L nor the level of sparsity in O are assumed 
known a priori. However, in many applications, it is a reasonable to have a 
rough upper bound of the rank, say rank(L) < q. Combining this with ([8|, it 
was proposed in [3S] to reformulate Q as 

min I IIX - DoS - 0|1^ + ^(||Do||^ + ||S|||) + A |10|1, , (9) 

with Do e M"""^, S e M9^", and O e M™""". This new factorized formulation 
reduces the number of optimization variables and reveals much structure hidden 
in the problem. The low rank component can now be thought of as an under- 
complete dictionary Dq, with q atoms, multiplied by a matrix S containing in 
its columns the corresponding coefficients for each data vector in X. This inter- 
pretation allows to write problem ([t]) in the form of our general parsimonious 
models with Z = (S; O), i^(Z) = ^ ||S||p + A ||0||,, D = (Do, I^xm), and 
0(D) — ^ IjDollp. Furthermore, unlike the nuclear norm, this new formulation 
of the rank-reducing regularization is differentiable and vector-wise separable 
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(well suited for the online setting ([3])). However, problem ([9]) is no longer con- 
vex. Fortunately, it can be shown that any stationary point of {Dg, S, O}, 
satisfying ||X — DqS — 0||2 < A* is an globally optimal solution of ^ j40] . 
Thus, problem ^ can be solved using an alternating minimization, as in our 
online setting, without the risk of falling into a stationary point that is not 
globally optimal. 



2.3 Non-negative matrix factorization 

Another popular low rank model is non- negative matrix factorization (NMF). 
Given a non- negative data matrix X S K™^", NMF aims at finding a factoriza- 
tion X DZ into non-negative matrices D e E™^'^ and Z e E''^", with q < n. 
The factorization is obtained by solving the highly non-convex problem 

min ||X-DZ||p. (10) 
D,z>o" ^ ^ 

Problem ( 10 1 can be stated as particular instance of ([!]) by setting -0 and to 
be the sum of element-wise indicator functions of the form 

^ {" -'ll (U) 

\ oo : t < 0. 

The non-negativity constrain has been shown to be crucial for learning a part 
representation of the data, making it particularly attractive in the problem of 
source separation. An extensive amount of work reported in the literature has 



been devoted to regularizing ( 10 ) in meaningful ways. 

Similarly to PCA, NMF is sensitive to outliers in the data matrix. A robust 
variant can be obtained by adding a sparse outlier term to the decomposition, 
X w DgS-l-O, as done in the RPCA model [12] • Again here the problem can be 
cast as a particular instance of ([T]) defining Z = (S; O) and D = (Do,Imxm)- 

NMF is by construction a low rank representation, since the number of atoms 
in Do is normally chosen to be significantly smaller than the dimension of the 
data. This means that an upper bound of the rank of the approximation needs 
to be known beforehand. NMF is known to be very sensitive to this parameter, 
due to the natural compromise between richness of the model and over-fitting. 
In most practical settings, q is carefully chosen based on empirical evidence. In 
[35j . we proposed a cure to this phenomenon by incorporating a rank- reducing 



term into (10), establishing in this way a link between NMF and the RPCA 
problem ([9]). Combined with the outlier term, we can formulate a robust low- 
rank NMF problem 

min ||X-DoS-0||p + A, ||DoS|L +A||0|L . (12) 

Do,S,0>0 ^ ^ 
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The resemblance of ( 12 ) to the RPCA problem tempts to apply the reasoning 



we used before to get rid of the nuclear norm, adding non- negativity constraints 
to 



mm - 

Do,S,0>0 2 



IX-DnS-OI 



|lDo||^ + |lS|l^) + A|10|l,. (13) 



However, unlike the RPCA case, problems ( 12 ) and ( 13 1 are not equivalent as 



the minimum of ([8]) is not necessarily attained by non-negative factors. In fact, 
adding non-negativity constraints to ([s]) produces 



|DoS| 



< 



< 



- min |||A 
2 A,B>0 ^" 



|B||^ s.t. AB = DqS} 



D 



OIIf 



(14) 



Thus, the sum of the Frobenius norms of the non-negative matrices D and 
S gives an upper bound on the nuclear norm of their product. While not 



being fully equivalent to (12), the objective in problem (13) still achieves both 



robustness to outliers and rank regularization. With some abuse of terminology. 



we will refer to problem (131 as to RNMF. Again here, (13) is a particular 



instance of ([T]) well suited for the online setting. 



3 Proximal methods 

Proximal splitting is a powerful optimization technique allowing to efficiently 
solve a variety of optimization problems, such as non-smooth convex programs. 
They have been adopted by the machine learning and signal processing com- 
munities for their simplicity, convergence guarantees, and the fact that they are 
well suited for tackling sparse and structured sparse coding problems that can 
be written as ([!]) (refer to [22] for recent reviews). In Section |4] we will use these 
algorithms to construct efficient learnable pursuit processes. 

Proximal splitting methods are designed for solving optimization problems 
in which the cost function can be split into the sum of two terms, one convex 
and differentiable with an a-Lipschitz continuous gradient, and another convex 
extended real valued and possibly non-smooth. Clearly, pursuit problems ^ fall 
into this category: the convex quadratic data fitting term has a linear gradient 
D""" (Dz — x) with the Lipschitz constant given by the squared spectral norm of 
the dictionary, a — ||D||^, and the regularizer tp is typically convex and non- 
smooth. The proximal splitting method with fixed constant step defines a series 



10 



input : Data x, dictionary D, weights A. 
output: Sparse code z. 

Define H = I - iD^D, W = iD^, t = ^A. 

Initialize z° = and b° = Wx. 

for k — 1,2,... until convergence do 

z^+i =7rt(b'=) 

b'^+i =b'= + H(z^-+i-z'=) 
end 

Algorithm 1: Iterative shrinkage-thresholding algorithm (ISTA). 
of iterates, {z'^j^gN, 

z'^+i=7r„,4z'=--^DT(Dz'=-x)), (15) 

where 

TTa4,{z) = argmin ||u - z||2 + ai/'(u) (16) 

denotes the proximal operator of ip. Fixed-step algorithms have been shown to 
have relatively slow sub-linear convergence, and many alternatives have been 
studied in the literature to improve the convergence rate [T51 HO] ■ Accelerated 
versions of the fixed-step algorithm can be used to reach linear convergence rates 
(the best possible for the class of first order methods) . The discussion of theses 
methods is beyond of the scope of this paper. 

Proximal splitting methods become particularly interesting when the prox- 
imal operator of ^p can be computed exactly and efficiently. Many important 
cases of structured sparsity fall into this category. For the simple unstructured 
sparsity models induced by regularizers of the form 

m 

^P{z) = |lA0z||^ ^^X,z,, 
i=i 

with denoting element-wise multiplication, the proximal operator reduces 
to the element-wise scalar soft-thresholding operator, (TTx{z))i — T\.{zi), with 
T\{t) ~ sign(i) max{0, \t\ — A}. In this case, the fixed step proximal splitting 
algorithm corresponds to the popular iterative shrinkage-thresholding algorithm 
(ISTA) [171 [TBI summarized in Algorithm [l] Note that the matrices H and W 
in Algorithm [T] are derived from the linear gradient of the data term. 

If z is furthermore constrained to be non-negative, as the sparse outlier in the 
RNMF problem, the soft thresholding is replaced by its one-sided counterpart 
r^(t) = max{0,t — A}. The proximal operator of the indicator function i^(t) 
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input : Data x, dictionary Dq, weights A, A* 
output: Approximation 1, outlier o. 

DefineH = I-i + , |,W=1| Land 





1' 









Do (l + A,)! 

Initialize z° = 0, b" = Wx. 
for k — 1,2,... until convergence do 
z'-^+i -7rt(b^) 
b'^+i ^b'- + H(z^-+i-z'=) 
end 

Split z'^^^ = (s; o) and output 1 — Dqs. 
Algorithm 2: Proximal descent algorithm for the online RPCA and RNMF 

problems with fixed dictionary Dq. The distinction between the two models 

is obtained through the selection of the proximal operator: 7rt(b) = Tt(b) 

(RPCA) and 7rt(b) = T^(h) (RNMF). 



imposing the non-negativity constraint is simply TQ{t) = max{0,i}. The fixed- 
step proximal descent algorithm for the online RPCA and RNMF problems is 
summarized in Algorithm [2j 

To generalize the proximal operator to the structured sparsity case, let us 
first consider an individual term in the sum (|6]), ipri^) =^ \\^r\\2- Its proximal 
operator, henceforth denoted as tt\^, can be computed as, 

r , \\ I II, II '''a^ (I I ^■'■11 2) • s = r, 
(7rA,(z)), = { iKh (17) 

where sub-indices r and s denote a sub-vector of the g,.-dimensional vector 
specified by the group of atoms A^. 

Note that tt\^ applies a group soft thresholding to the coefficients belong- 
ing to the r-th group and leaves the remaining ones unaffected. For a non- 
overlapping collection of groups the proximal operator oftpg is group-separable 
and can be computed independently for each group as 

7rx(z) = ((7rAi(z))i;-- - ;(7rA|g|(z))|g|), 

where A = (Ai, . . . , A|g|)"'" is the vector with the threshold parameters A^. 

In general, when the groups of G overlap, there is no efficient way of com- 
puting the proximal operator of ^g. An important exception to this is the 
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hierarchical setting with tree-structured groups. Let us be given a tree hierar- 
chy of groups G = Gi ^ ■ ■ ■ ^ Gl with each Gi aggregating r; non-overlapping 
groups corresponding to the l-th level of the tree. Then, the proximal operator 
of tjjg can be shown to be given by the composition of the proximal operators of 
tpg^ in ascending order from the leaves to the root [HI [Hj i = tt^^^ o • • • o tt^^ . 
Here A; e M''' denotes the sets of weights corresponding to the constituent group 
of each level. A particular case of the tree-structured hierarchical sparse model 
is the two-level HiLasso model introduced to simultaneously promote sparsity 
at both group and coefficient level [HI EI] . Algorithm [l] is straightforward to 
generalize to the case of hierarchical sparsity by using the appropriate proximal 
operator. 

It is worthwhile noting that the update in Algorithm [l] can be applied to a 
single element (or group in case of structured sparsity) at a time in a (block) 
coordinate manner. Several variants of coordinate descent (CoD) and block- 
coordinate descent (BCoD) proximal methods have been proposed [12 [TO] . 
Typically, one proceeds as in Algorithm [l] first applying the proximal oper- 
ator y = 7r(b'^). Next, the residual e = y — z'^ is evaluated, and the group is 
selected e.g. according to r = argmax^ ||er||2 (in case of unstructured sparsity, 
r — argmaxr |ej.|). Then, b*"^^ is computed by applying H only to the selected 
subgroup of e, and z'^^^ is computed by replacing the subgroup of z'^' with the 
corresponding subgroup of y. 



4 Learnable pursuit processes 

The general parsimonious modeling problem ([T]) can be alternatively viewed as 
the minimization problem 

1 " 

a;:K9-i-R" 

with L{x,z,x) = i II (id — a; o 2;) (x) lip -f- ip{z{x.)) + 4>{x). The optimization 
is now performed over an encoder z — z{yi) mapping the data vector x to 
the representation vector z, and a decoder x = x{z,) performing the converse 
mapping. The encoder/decoder pair is sought to make the composition x o z 
close to the identity map, under the regularity constraints promoted by the 
penalties tp and (j). 

Existing parsimonious models restrict the decoder to the class of linear func- 
tions a;D(z) — Dz parametrized by the dictionary D. For a fixed dictionary D, 
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the optimal encoder is given by 

z* = are min L(x, z,xr>), (19) 

which is nothing but the solution of the representation pursuit problem obtained 
through and iterative optimization algorithm, such as the proximal methods 
described in Section [3] This interpretation is possible since the solution of the 
pursuit problem implicitly defines a deterministic mapping that assigns to each 
input vector x S M" a unique parsimonious code z e M'". Naturally, this 
mapping cannot be stated explicitly. 

In contrast, the process-centric approach proposed in this work, aims at 
formulating a modeling scheme were both the encoder and the decoder can 
be explicitly stated and efficiently computed. In our proposed framework, the 
encoders are constructed explicitly as parametric deterministic functions, zq : 
M" M™ with a set of parameters collectively denoted as 0, while the decoders 
are the exact same simple linear decoders, x-d(z) = Dz, used in model-centric 
approaches (we relax this assumption in the following section) . We denote by JF 
the family of the parametric functions 2;©. Naturally, two fundamental question 
arise: how to select the family J-' capable of defining good parsimonious models, 
and how to efficiently select the best parameters given a specific family. We 
will refer to the first problem as to selecting the architecture of the pursuit 
process, while the second will be referred to as process learning. We start with 
the latter, deferring the architecture selection to Section [43l 



4.1 Process learning 



With the process-centric perspective in mind, problem ( 18 ) can be stated as 

1 " 

min -y^L(xi,2;©,a;D), (20) 

i—l 

where the family T imposes some desired characteristics on the encoder such as 
continuity and almost everywhere differentiability, and certain computational 
complexity. As in ([T]) , this problem can be naturally solved using an alternating 
minimization scheme, sequentially minimizing for z& or D while leaving the 
other one fixed. Note that when the encoder z@ is fixed, the problem in D 
remains essentially the same dictionary update problem and can be solved ex- 
actly as before. In what follows, we therefore concentrate on solving the process 
learning problem 



mm 
z&eT n 

i—i 



1 " 

-Vi(x„;20), (21) 

77 r J 
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given a fixed dictionary. To simplify notation, we henceforth omit ccd from L 
whenever D is fixed. 



Observe that problem (21) attempts to find a process z© in the family F 
that minimizes the empirical risk over a finite set of training examples, as an 
approximation to the expected risk 

1 " f 

i—l 

over the data distribution P. While the empirical risk measures the encoder per- 
formance over the training set, the expected risk measures the expected perfor- 
mance over new data samples following the same distribution, that is, the gener- 
alization capabilities of the model. When the family J- is sufficiently restrictive, 
the statistical learning theory justifies minimizing the empirical risk instead of 
the expected risk[43j. We will come back to this issue in Section [X2| where we 
address the accuracy of the proposed encoders in approximating L(2;@). 

When the functions belonging to T are almost everywhere differentiable with 
respect to the parameters 0, stochastic gradient descent (SGD) can be used to 



optimize (21), with almost sure convergence to a stationary point 04]. At each 
iteration, a random subset of the training data, {xj^, . . . ,Xi^}, is selected and 
used to produce an estimate of the (sub)-gradient of the objective function. 
Specifically, in our case the parameters of z© are updated as 

k=l 

where fj, is a decaying step, repeating the process until convergence. This re- 
quires the computation of the (sub)-gradients dL/d&, which is achieved by 
a back-propagation procedure as detailed in the sequel. SGD algorithms scale 
well to big data applications where the limiting factor is the computational time 
rather than the number of available samples. 

4.2 Approximation accuracy 

Following |44j . we split the process training approximation error into three 
terms, e = eapp -l- Ecst + fopt- The approximation error Capp = ]E{L(2;@) — 
L{z*)} measures how well the optimal unrestricted pursuit process z* given 



by ( 19 1 is approximated by the optimal pursuit process restricted to J-, z^ = 
argmin^ggjF- 'L{z@). The estimation error Cost = E{L(zq) — L{zq)} with i© = 
arg min^ggjF- L(2;©) measures the cost of optimizing the empirical risk instead 
of the expected risk. Finally, the optimization error eopt ~ E{L(2;©) — L(i:©)} 
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measures the effect of having z@ that minimizes the empirical risk only approx- 
imately. 

The estimation error vanishes asymptotically with the increase of the train- 
ing set size n. The optimization error can be made negligible (at least, in 
the offline setting) by simply increasing the number of SGD iterations. Conse- 
quently, the quality of the learned pursuit process largely depends on the choice 
of the family which we will address in the next section. 

4.3 Process architecture 

We extend the ideas introduced in |30| to derive families of trainable pur- 
suit processes from proximal methods. Let us examine a generic fixed-step 
proximal descent algorithm described in Section [3j Each iteration can be de- 
scribed as a function receiving the current state (bi„,Zi„) and producing the 
next state (bout,Zout) by applying the non-linear transformation Zout ~ 7rt(bin) 
(representing the proximal operator), and the linear transformation bout = 
bin + H(zout — Zin) (representing the linear part of the gradient). This can 
be described by the function (bout,Zout) — /h 1(^111, Zin) parametrized by the 
matrix H describing the linear transformation, and the vector t describing the 
parameters of the proximal operator TTt . 

A generic fixed-step proximal descent algorithm can therefore be expressed 
as a long concatenation of such iterations, 2;* (x) = • • • o fj^ ^ o • • • o /jj.t(Wx, 0), 
with the initialization (bi,i,Zin) = (Wx, 0). For example, Algorithms [l] and [2] 
follow this structure exactly, for the appropriate choice of the parameters. Block- 
coordinate proximal descent algorithms operate very similarly except that the 
result of the application of / is substituted to a subset of the elements of the 
state vector, selected in a state-dependent manner at each iteration. 

Following [30] ■ we consider the family of pursuit processes derived from 
truncated proximal descent algorithms with T iterations, J-t — {zt,&{'x.) = 
fut ° • ■ ■ ° /Ht(W^iO)}- convenience, we collected all the process pa- 
rameters into a single pseudo-vector = {W, H,t}. Also note that at the 
last iteration, only z of the state vector (b, z) is retained; with some abuse of 
notation, we still denote the last iteration by /. Finally, we denote by J-'oo the 
family of untruncated processes. 

A process zt.& G J^t can be thought of as a feed-forward neural network 
with identical layers / jj ^ . Flow diagrams of processes derived from the prox- 
imal descent Algorithms [l] and [2] for the Lasso and RPCA/RNMF problems 
are depicted in Figure IT] Since the processes are almost everywhere with 
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Figure 1: Encoder/decoder architectures for the unstructured Lasso (top), and 
the robust PCA/NMF (bottom). 



respect to the input x and the parameters 0, it is possible to calculate the 
sub-gradients required for the optimization algorithms. The computation of the 
sub-gradients of L(x, 2;©(x)) with respect to is carried out by an iterated 
application of the chain rule starting at the output and propagating backward 
into the network. The procedure, frequently referred to as back-propagation, is 
detailed in Algorithm [3j where following the standard notation from the neural 
network literature, the b prefix denotes the gradient of L with respect to the 
variable following it, (5* = dL/d*. 



4.4 Approximation error vs. complexity trade-off 

Since the objective minimized via proximal descent is convex, it is guaranteed 
that there exists some selection of the parameters 0* such that limT_5.oo zt,®* = 
z*. In other words, the optimal process z* is contained in J^^o- Furthermore, re- 
formulating the non-asymptotic convergence analysis from |18j in our language, 
the following holds: 

Theorem 1 (Beck&TebouUe). For every D, there exists 0* and C > such 

C 

that for every z and T > 1, £(x, zt.&*) — L{x, z*) < ^||z*(x)||2. 

This result is worst-case, in the sense that it holds for every input vector x. 
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input : Sub-gradient (5z = — . 

oz 

output: Sub-gradients of L with respect to the parameters, 5H, JW, ^t; 
and with respect to the input, 5x. 

Initiahze ,5t^ = ^""^Sz, 6h'^ = ^^""^Sz, = 0, and Sz^ = 
at ab 

for fc = T-l,r-2,...,l do 

(5H*^-i = Sn'' + (5b'=(z'^+i - z'^f 

(5t'=-i = ^t'^ + ^!^(HT<5b'= - Sz") 
ot 

^b-^=^b^ + ^HW + ^^z^ 
at ab 

end 

Output dU = (5H°, 6t = 6t°, 6W = Sh'^x^, and (5x = W^Sh'' 
Algorithm 3: Computation of the sub-gradients of L(x, zt,0(x)) for a pur- 
suit process zt,& & Tt- The reader is warned not to confuse the T-th iteration 
index, z^, with the transpose, z^ . 



Assuming bounded support of the input distribution and taking expectation 
with respect to it, we obtain the following: 

Corollary 1. There exist 0* and C > such that for T > 1, 

eapp=E{i(zT,0O-i(^*)}< ^- 

In other words, the family J^t of pursuit processes allows to set the approxima- 
tion error to an arbitrarily small number. 

One may wonder whether the iter we have undergone so far is of any value 
at all, given that the process learning approach is only capable of approximating 
the optimal pursuit process achieved via an iterative algorithm. This picture 
totally changes, however, when we consider the trade-off between the approx- 
imation error, eapp, and the computational complexity of the encoder, which 
is proportional to T. The bound in Theorem 1 is uniform for every input x, 
independently of the input distribution P, while in practice we would strive 
for a much faster decrease of the error on more probable inputs. This implies 
that the bound in Corollary 1 is by no means tight, and there might be other 
selections of giving much lower approximation error at a fixed T. While 
pursuit processes optimal in the sense of the expected risk can be found using 
the process learning approach, there is no simple way for an iterative pursuit 
algorithm to take the input data distribution into account. 
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Figure 2: Comparison between RNMF encoders and proximal methods for 
music source separation. Optimality gap as a function of the number of itera- 
tions/layers with encoders trained in the unsupervised regime (left); and the li 
separation error obtained with supervised and the unsupervised training (bot- 
tom) . The red dotted line represents the error achieved by Algorithm [2] after 
convergence. 



An illustration to the advantage of the trained pursuit process is shown in 
Figure [2] (left), in which the performance of RNMF encoders is compared to 
that of the corresponding proximal algorithms. As the example, we used the 
audio separation problem described in further details in sections 4.4 and [6] The 
figure shows the optimality gap \j{zt,&) — L(2;*) as a function of T for the 
truncated proximal descent (0 — 8° set as prescribed by Algorithm [2| and 
for the trained encoder (8 ~ 8*). This optimality gap can be thought of as 
an empirical approximation error, Capp, for the corresponding spaces Tt- It 
takes about 70 iterations of the proximal method to reach the error obtained by 
a 7— layer encoder. A further and stronger justification to the process-centric 
approach advocated in this paper is presented in the next section. 



5 Training regimes 

As it was described in the previous section, parsimonious modeling can be inter- 
preted as training an encoder/decoder pair (z^x) that minimizes the empirical 



loss L in problem (18). We refer to this setting as unsupervised, as no informa- 
tion beside the data samples themselves is provided at training. An important 
characteristic of this training regime is that it can be performed online, com- 
bining online adaptation of with online dictionary update as described in 
Section m 
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In many applications, one would like to train the parsimonious model com- 
prising the encoder/decoder pair to optimally perform a specific task, e.g., 
source separation, object recognition, or classification. This setting can be still 



viewed as the solution of the modeling problem (18), with the modification of 
the training objective L to include the task-specific knowledge. However, for 
most objectives, given a fixed linear decoder a;D(z) — Dz, the encoder of the 
form 

1 2 

2;(x) = arg min - ||x — Dz||p + -(/^(z) (23) 



is no more optimal, in the sense that it is not the solution of ( 18 ) with the 



fixed decoder. Moreover, the inclusion of an encoder of the form ( 23 ) into the 



modeling problem ( 18 1 gives rise to the hard bi-level optimization problem 



1 1 

min L{xi,Zi,x-D) s.t. Zj = arg min - ||xi - Dz||p + ■0(z). (24) 

DeK'nx'j n ^ — ' zeR9 2 

2—1 

Trainable pursuit processes provide an advantageous alternative in these 
cases. First, by being an explicit deterministic function, they remove the need to 
solve the bi-level optimization problem. Second, they have fixed complexity and 
latency controllable through the parameter T, and they are expected to achieve 
a better complexity-approximation error trade-off than the iterative pursuit 
processes. Third, trainable processes constitute a good compromise between 
the two extremes: the iterative pursuit having no elements of learning at all, 
and the general regression problems, fully relying on learning. The described 
trainable processes have the structure of the iterative pursuit built into the 
architecture on one hand, while leaving tunable degrees of freedom that can 
improve their modeling capabilities on the other. Furthermore, such processes 
come with a very good initialization of the parameters, which is non-trivial in 
the more general learning scenarios. 

In the sequel, we exemplify various training regimes (i.e., different objectives 



in ( 18 1) that can be used in combination with the architectures described in Sec- 
tion |4.3[ leading to a comprehensive framework of process-centric parsimonious 
models. 



5.1 Supervised learning 

In many applications, parsimonious models are used as a first order approxima- 
tion to various classes of natural signals. An example is the music separation 
problem discussed in Section [^4) While achieving excellent separation results. 
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both the RPCA and RNMF models are merely a crude representation of the 
reality: the music is never exactly low-rank, as well as the voice does not exactly 
admit the naive unstructured sparse model. We propose to fill the gap between 
the parsimonious models and the more sophisticated (and, hence, prone to er- 
rors) domain-specific models, by incorporating learning. This can be done by 



designing the objective function L of ( 18 ) to incorporate the domain-specific 



information. Figure [2] (bottom) shows that encoders trained this way outper- 
form not only their unsupervisedly trained counterparts, but also the exact 
RPCA/RNMF algorithm. 

Sticking to the example of audio source separation by means of online robust 
PC A or NMF, let us be given a collection of n data vector, x^, representing the 
short-time spectrum of the mixtures, for which the clean ground-truth accompa- 
niment 1* and voice o* spectra are given. We aim at finding an RPCA/RNMF 
encoder (s;o) — 2;©(x) with the architecture depicted in Figure [I] and the lin- 
ear decoder (1, o) = D(s;o) parametrized by D = (Do,Imxm), that minimize 



( 18 ) with the objective 

1 " 

L(z©, D) = - V ||Dos©(x,) - l*\\l + ||o©(xi) - o, 

n ^ — ^ 



* l|2 



n 

i=l 



+ y (IIDoll^ + ||50(x.)|||) + A||o©(x,)||i. (25) 

Note that the architecture and the initial parameters of the encoder are already 
a very good starting point, yet the supervised training allows it to better model 
the reality that is not fully captured by the RPCA/RNMF model (Figure [2) 
bottom). 

In a broader perspective, the sound separation example can be viewed as a 
particular instance of supervised encoder/decoder training regimes, in which the 
loss L is expressed in terms of the composite output of xo z, but is supervised 
by some different from the input data, e.g., 

1 " 

i{z, x) = -Y, \\y^ - X o z{^i)\\l + i^{z[^^)) + c^{x). (26) 

1=1 

Compare this to the unsupervised setting, where essentially = x^, and the 
performance of the encoder/decoder pair is measured by their ability to recon- 
struct the input. 

The idea of |30) to train pursuit processes to approximate the output of 
iterative pursuit algorithms falls into this category. For each training data 



vector Xi, let z* — 2*(xi) be the output of the optimal encoder (19). Setting 
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the decoder to the identity map, x = id in (26) reduces the supervision to the 
encoder outputs, 

n 

i{z) = -Y,\\z,-z{^,)\\l (27) 

1=1 

While producing good results in practice, under the limiting factor that the 
testing data is from the same class as the training data, this training regime 
requires supervision yet it is unable to improve the modeling capabilities of the 
encoder beyond those of the underlying parsimonious model (unlike the super- 
vision of the decoder outputs in (26 1). On the other hand, we show in Section |6] 



that similar performance can be achieved by using unsupervised training. We 
refer to this regime as Approximation. 

5.2 Discriminative learning 

The supervised training setting is also useful to extend parsimonious models 
beyond the conventional generative scenario, in which the data can be approx- 
imately recovered from the representation, to the discriminative scenario, such 
as classification problems, where the representation is typically non-invertible 
and is intended to capture various invariant properties of the data. 

As an illustration, we use the simultaneous speech denoising and speaker 
identification model from in which the spectrogram of the input signal is 
decomposed into X w DqS + DO, where DqS capturing the noise is required 
to be low-rank, while the activation O representing the speech is required to 
be sparse. A collection of speaker-specific dictionaries, Di, . . . ,Dfc, is trained 
offiine, and the models are fit to previously unobserved data. The lowest fit- 
ting error is then used to assign the speaker identity. See Section |6.4.2| for 
experimental evaluation with real data. 

Using the process-centric methodology, we construct k encoders {sq. , o©^. )(x), 
and k corresponding decoders (DqS, Djo) with the shared noise dictionary Do- 
The encoder/decoder pairs are trained by minimizing an empirical loss of the 
form 

L(x,Z,ei,...,efc,Do,...,Dfe) = |lx-DoS0,(x)-D,O0,(x)|l2 (28) 
+ max{0,e- ||x - Dqs©^. (x) - DjOe^(x)||^} , 

averaged on the training set containing examples of noisy speech, x, and the 
corresponding labels, I € {1, . . . , fc} indicating which speaker is present in the 



sample. For each training sample, the loss function (28 1 promotes low fitting 



error for the encoder from the class coinciding with the ground-truth class, while 



22 



favoring high fitting error for the rest of the encoders. The hinge parameter e 
counters excessive influence of the negatives. 



In Section 6.4.2 we show empirical evidence that encoder/decoder pairs 
trained using a discriminative loss perform the classification task better than 
those trained to produce the best reconstruction. 

5.3 Data transformations 

Parsimonious models rely on the assumption that the input data vectors are 
"aligned" with respect to each other. This assumption might be violated in 
many apphcations. A representative example is face modeling via RPCA, where 
the low dimensional model only holds if the facial images are pixel-wise aligned 
[33j . Even small misalignments can break the structure in the data; the repre- 
sentation then quickly degrades as the rank of the low dimensional component 
increases and the matrix of outliers loses its sparsity. In [33] , it was proposed 
to simultaneously align the input vectors and solve RPCA by including the 
transformation parameters into the optimization variables. This problem is 
highly non-convex, yet if a good initialization of the transformation parameters 
is available, a solution can be found by solving a sequence of convex optimization 
problems, each of them being comparable to ([t]). 

Following |33j . we propose to incorporate the optimization over geometric 
transformations of the input data into our modeling framework. We assume that 
the data are known to be subject to a certain class of parametric transformations 
T = {Sq, : M™ — > M™ : a.}. Given a data matrix X, we search for its best 
alignment together with finding the best model by solving 



mm 

ai,...,otn 



1 " 

-y'^(fl'a,(xi),2;©,a;D), (29) 



jointly optimizing over the encoder zq, possibly the decoder xd, and the trans- 
formation parameters cXi of each of the training vectors. This approach can be 
used with any of the training objectives described before. 



The optimization problem ( 29 ) can be carried out as a simple extension of our 
general process training scheme. Transformation parameters ct can be treated 
on par with the encoder parameters 0; note that since the encoder functions 
z@ G J- are by construction almost everywhere differentiable with respect to 
their input, we can find the sub-gradients of the composition z@{g^{x)) with 



respect to a by simply applying the chain rule. This allows to solve (291 using 
SGD and the back-propagation in Algorithm [3j 
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The obtained encoders are conceptually very similar to the ones we had 
before and can still be trained in an online manner. In this particular setting, 
our framework presents a significant advantage over existing approaches based 
on iterative pursuit. In general, as the new data arrive and the model is refined 
or adapted, the alignment of the previously received input vectors has to be 
adjusted. In our settings, there is no need to recompute the alignment for the 
whole dataset from scratch, and it can be adjusted via online SGD. 

Unlike the untransformed model, the pursuit is no more given by an explicit 
function 2;®, but requires the solution of a simple optimization problem in 
OL, OL* = argminL(x, 2;© og^,x^). A new data vector x is then encoded as 

OL 

6 Experimental results 

In what follows, we describe experiments conducted to assess the efficiency of 
the proposed framework. Fast encoders were implemented in Matlab with built- 
in GPU acceleration and executed on Intel Xeon E5620 CPU and NVIDIA Tesla 
C2070 GPU. When referring to a particular encoder, we specify its architecture 
and the training regime; for example "CoD (Unsupervised)" stands for the CoD 
network trained in the unsupervised regime. We denote by Approximation, 
Supervised, Unsupervised and Discriminative the corresponding training regime 
as defined in Section [5j Untrained denotes an untrained encoder, that is, with 
parameters set as in the corresponding fixed-step proximal descent algorithm; 
the performance of such an encoder coincides with that of a truncated proximal 
descent. 

6.1 Online sparse encoders 

To evaluate the performance of the unstructured CoD [Unsupervised) encoders 
in the online learning regime, we used 30 x 10** randomly located 8x8 patches 
from three images from the Brodatz texture dataset [33]. The patches were 
ordered in three consecutive blocks of 10'' patches from each image. Dictio- 
nary size was fixed to g = 64 atoms, and T = 4 layers were used in all CoD 
encoders. Unsupervised online learning was performed using the Lasso objec- 
tive with A = 1 on overlapping windows of 1,000 vectors with a step of 100 
vectors. Online trained encoders were compared to an online version of Lasso 
with the dictionary adapted on the same data. As reference, we trained offline 
a CoD [Approximation) encoder and another CoD ( Unsupervised) encoder. Of- 
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Figure 3: Performance of CoD sparse encoders, trained under different training 
regimes, measured using tfie Lasso objective, as a function of sample number 
in tlie online learning experiment. Shown are the three groups of patches cor- 
responding to different texture images from the Brodatz dataset. 

fline training was performed on a distinct set of 6,000 patches extracted from 
the same images. 

Performance measured in terms of the Lasso objective is reported in Fig- 
ure [s] (high error corresponds to low performance). Initially, the performance of 
the online trained CoD (Unsupervised) encoder is slightly inferior to the offline 
trained counterpart; however, the online version starts performing better after 
the network parameters and the dictionary adapt to the current class of data. 
The CoD (Approximation) encoder trained offline exhibits the lowest perfor- 
mance. This experiment shows that, while the drop in performance compared 
to the exact Lasso is relatively low, the computational complexity of the online 
CoD ( Unsupervised) encoder is tremendously lower and fixed. 

6.2 Structured sparse encoders 

The performance of the BCoD structured sparse architecture derived from Al- 
gorithm [l] combined with different training regimes is evaluated on a speaker 
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Table 1: Speaker identification misclassification rates. 



Code Error rate 

Exact HiLasso 2.35% 
NN CoD {Approximation) 6.08% 
NN BCoD (Approximation) 3.53% 
NN BCoD (Discriminative) 3.44% 



identification task reproduced from |47) . In tfiis application the authors use 
hierarchical sparse coding to automatically detect the speakers in a given mixed 
signal. The dataset consists of recordings of five different radio speakers, two 
females and three males. 25% of the samples were used for training, and the rest 
for testing. Within the testing data, two sets of waveforms were created: one 
containing isolated speakers, and another containing all possible combinations 
of mixtures of two speakers. Signals were decomposed into a set of overlapping 
time frames of 512 samples with 75% overlap, such that the properties of the 
signal remain stable within each frame. An 80-dimensional feature vector is 
obtained for each audio frame as its short-time power spectrum envelope (refer 
to [17] for details). Five under-complete dictionaries with 50 atoms each were 
trained on the single speaker set minimizing the Lasso objective with A = 0.2 
(one dictionary per speaker), and then combined into a single structured dictio- 
nary containing 250 atoms. Increasing the dictionary size exhibited negligible 
performance benefits. Speaker identification was performed by first encoding a 
test vector in the structured dictionary and measuring the £2 energy of each of 
the five groups. Energies were sum-pooled over 500 time samples selecting the 
labels of the highest two. 

To assess the importance of the process architecture, a CoD (Approximation) 
and a BCoD (Approximation) encoders with T = 2 layers were trained offline 
to approximate the solution of the exact HiLasso, with A2 — 0.05 (regularizer 
parameter in the lower level of the hierarchy). A BCoD (Discriminative) encoder 
with the same settings was also trained in the supervised regime using the 
discriminative loss function ( 28 ) to promote or discourage the activation of 
groups corresponding to knowingly active or silent speakers respectively. 

Table [1] summarizes the obtained misclassification rates. In agreement with 
the experiments shown in Section |4.3[ using an appropriate structured archi- 
tecture instead of its unstructured counterpart with the same number of layers 
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and the same dictionary increases performance by nearly a factor of two. The 
use of the discriminative objective further improves performance. Surprisingly, 
using encoders with only two layers cedes just about 1% of correct classification 
rate. The structured architecture showed a crucial roll in producing accurate 
structured sparse codes. Other comparative experiments substantiating this 
observation can be found in |34j . 

6.3 Robust PCA encoders with geometric optimization 

In order to evaluate the performance of robust PCA encoders, we used a face 
dataset consisting of 800 66 x 48 images of a female face photographed over 
the timespan of 4.5 years, roughly pose- and scale-normalized and aligned. The 
images manifested a significant variety of hair and dressing styles, while keeping 
similar facial traits. Following [33 , we use RPCA to decompose a collection of 
faces represented as columns of the data matrix X into L-|-0, with the low- rank 
term L = DqS approximating the face identity (atoms of Dq can be thought of 
as "ei^en/aces"), while the outlier term O capturing the appearance variability. 

We trained a five layer RPCA ( Unsupervised) encoder on 600 images from 
the faces dataset. The dictionary was initialized using standard SVD and param- 
eters were set to q = 50, A* =0.1 and A = 10~^. To evaluate the representation 
capabilities of RPCA encoders in the presence of geometric transformations, as 
a test set, we used the remaining 200 faces, as well as a collection of geomet- 
rically transformed images from the same test set. Sub-pixel planar transla- 
tions were used for geometric transformations. The encoder was applied to the 
misaligned set, optimizing the unsupervised objective over the transformation 
parameters. For reference, the encoder was also applied to the transformed and 
the untransformed test sets without performing optimization. Examples of the 
obtained representations are visualized in Figure [4j Note the relatively larger 
magnitude and the bigger active set of the sparse outlier vector o produced 
for the misaligned faces, and how they are re-aligned when optimization over 
the transformation is allowed. Since the original data are only approximately 
aligned, performing optimal alignment during encoding frequently yields lower 
cost compared to the plain encoding of the original data. 
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Original Misaligned Optimally aligned 









0.8931 1.8981 1.4502 1.1015 2.8364 3.6438 3.6123 2.9722 0.8209 1.8518 1.3844 1.0884 

Figure 4: Robust PCA representation of the faces dataset in the presence of 
geometric transformations (misalignment). Left group: original faces; middle 
group: shifted faces; right group: faces optimally realigned during encoding. 
First row: reconstructed face Ds + o; middle row: its low-rank approximation 
Ds; and bottom row: sparse outlier o. RPCA loss is given for each representa- 
tion. 

6.4 Robust NMF encoders 
6.4.1 Singing voice separation 

We now evaluate the source separation problem (singing-voice/background- 



accompaniment), described in Section 4.4 The separation performance was 



evaluated on the MIR-IK dataset [IS], containing 1,000 Chinese karaoke clips 
performed by amateur singers. The experimental settings closely followed that 
of [48], to which the reader is referred for further details. As the evaluation 
criteria, we used the normalized source-to-distortion ratio (NSDR) from the 
BSS-EVAL metrics [32]: averaged over the test set. Encoders with RNMF ar- 
chitecture composed by T = 20 layers and q — 25 were trained using different 
training regimes. We used A = y/2na and A* — \f2a with g — 0.3 set following 
[13]. Tabled summarizes the obtained separation performance. While {Vn- 
supervised) training regime makes fast RNMF encoders on par with the exact 
RNMF (at a fraction of the computational complexity and latency of the latter), 
significant improvement is achieved by using the (Supervised) regime, where the 
encoders are trained to approximate the ground-truth separation over a reduced 
training set. For further details refer to [?5] . 
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Table 2: Audio separation quality {dB SDR). 



Method 


Voice 


Music 


Ideal frequency masking 


13.48 


12.51 


Exact RNMF 


5.60 


3.17 


RNMF (Untrained) 


1.62 


4.31 


RNMF ( Unsupervised) 


5.00 


4.01 


RNMF (Supervised) 


6.36 


4.32 



6.4.2 Robust speaker identification 

The purpose of this experiment is to show the benefits of training parsimonious 
models in a discriminative fashion when used for classification tasks. As an 
example we use speaker identification in environments heavily contaminated by 
unstructured noise described in Section 15.21 We evaluated the classification 
capabilities of different low rank NMF architectures in combination with two 
supervised training regimes discussed in Section [5] one aimed to produce a good 
reconstruction of the speech signal and another one optimized to produce the 
best classification. The reconstructive approach is analogous to the one used in 
Section 6.4.1 for music/singing- voice separation, but using a low rank NMF for 
both noise and human speech. In all our examples we used T = 10 layers and 



q = 50. Parameters A and A* were chosen as in Section 6.4.1 



As speech dataset we used a subset of the GRID dataset [501 containing 10 
distinct speakers; each speaker comprising 1,000 short clips. Three sets of 200 
distinct clips each were used for training, validation, and testing. The GRID 
clips were artificially contaminated by six categories of noise recorded from dif- 
ferent real environments (street, restaurant, car, exhibition, train, and airport) 
taken from the AURORA corpus [5T] . The voice and the noise clips were mixed 
linearly with equal energy (0 dB SNR). In all experiments, the spectrogram of 
each mixture was computed using a window of size 512 and a step size of 128 
samples (at 8 KHz sampling rate). For further details please refer to [45]. Ta- 
ble [3] summarizes the classification rates of the compared methods. While the 
results obtained using the low rank NMF (Supervised) are very good, they are 
significantly outperformed by the low rank NMF (Discriminative) encoders. 
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Table 3: Speaker identification success rate. 







RNMF Encoders 


Noise 


Exact 






(Supervised) 


( Discriminative ) 


street 


0.86 


0.91 


0.91 


restaurant 


0.91 


0.89 


0.90 


car 


0.90 


0.91 


0.96 


exhibition 


0.93 


0.91 


0.95 


train 


0.93 


0.88 


0.96 


airport 


0.92 


0.85 


0.98 


average 


0.91 


0.89 


0.94 



7 Conclusions and future work 

In this work wc have developed a comprehensive framework for process-centric 
parsimonious modeling. By combining ideas from convex optimization with 
multi-layer neural networks, we have shown how to produce deterministic func- 
tions capable of faithfully approximating the optimization-based solution of 
parsimonious models at a fraction of the computational time. Furthermore, 
at almost the same computational cost, the framework includes different objec- 
tive functions that allow the encoders to be trained in a discriminative fashion 
or solve challenging alignment problems. We conducted empirical experiments 
in different settings and real applications such as image modeling, robust face 
modeling, audio sources separation and robust speaker recognition. A simple 
unoptimized implementation already achieves often several order of magnitude 
speedups when compared to exact solvers. 

While we limited our attention to synthesis models, the proposed framework 
can be naturally extended to analysis cosparse models |52[ I53j , in which the 
signal is known to be sparse in a transformed domain. Specifically, given a 
"sensing" matrix M € M"^? and an analysis dictionary ft e Rp^™, in an 



analysis counterpart of (21), one looks for a function f £ J^, where again is 



a space of functions with certain desired properties, that minimizes 



1 ^ 



mm 2 E 11^' - M/(xi)f + A ||f2/(xi)||, . (30) 

i—l 



The space T can be set by truncating suitable iterative optimization algorithms 
such as the augmented Lagrangian methods of multipliers ( ADMM) [21] . 
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